This is a study note for using \(lmerTest\) package for Linear Mixed-Effects Models/Regression (LMER).
The modeling framwork for random effect is Linear Mixed-Effects Models, or the following equivalent name in different disciplines:
Mixed Effects Model, Random Effects Model, Linear Mixed ModelsHierarchical Linear ModelMultilevel ModelRandom Coefficients Regression ModelGrowth Curve Model(source: Xie (2010) Regression Analysis (Chinese), CH16)
Let’s quickly introduce random effect and fixed effect then explain why need to use LMM to deal with nest data.
Nested data: Level of one random effect only occur within one unique level of the higher. It often occurs in the design of experiments and in particular in the context of randomized experiments and randomized controlled trials. wikiWithout a random effect for each experimental unit, a one-to-one correspondence between observations and experimental units is assumed. Including random effects in a model is one way to account for a lack of independence among observations that might be expected based on the design of experiment. Here is the definitions of fixed effect and random effect:
Fixed effect: test for effect of parametersRandom effect: control for non-independence from nested or hierarchical structure. i.e. blocking, replicated sampling from individuals, a subset of many possible group that could be sampled.In probability and statistics, Simpson's paradox refer to a trend appears in several different groups of data but disappears or reverses when these groups are combined. This result is often encountered in social-science and medical-science statistics and is particularly problematic when frequency data is unduly given causal interpretations. When not consider level for data where samples are not independent with levels, problems can be resulted in within-group regression, Between-group regression, and Total regression, shown in the folowing:
When imporperly consider consider level in a model, there two kind of statisical fallaies may occur:
Ecological Fallacy: a formal fallacy in the interpretation of statistical data that occurs when inferences about the nature of individuals are deduced from inferences about the group to which those individuals belong. It could occur when when consider only cluster (lv.2) as the independent variableAtomistic fallacy: the fallacy of drawing inferences regarding variability across units defined at a higher level based on data collected for units at a lower level. It could occur when consider only individual samples (lv.1) as the dependent variableLinear Mixed-Effects Models assume that the data being analysed are drawn from a hierarchy of different populations whose differences relate to that hierarchy.
partial polling or shrinkage. (source: https://www.youtube.com/watch?v=QCqF-2E86r0 )| Reference | Minimum ratio of Group size/Sample size |
|---|---|
| Kreft (1996) | 30/30 |
| Hox (1998) | 100/10 |
The ICC, or Intraclass Correlation Coefficient, can be very useful in many statistical situations, but especially so in Linear Mixed Models. The definition of ICC is the following:
\[ ICC = \rho = \frac{IntraclassVariance}{TotalVariance }\]
Why ICC is useful?
(Source: https://www.theanalysisfactor.com/the-intraclass-correlation-coefficient-in-mixed-models/)
Luke (2004)’s idea of whether using LMM:
| Intraclass correlation | ICC | if Applying LMM |
|---|---|---|
| low | ICC < 0.059 | No neccessary |
| moderate | 0.059 < ICC < 0.138 | Yes |
| high | ICC > 0.139 | Yes |
ICC for unconditional and conditional models:
(source: https://www.rdocumentation.org/packages/sjstats/versions/0.17.4/topics/icc)
In R, both \(lmer\) and \(lmerTest\) package can model Linear mixed-Effects models/regression. We will use \(lmerTest\) package because it can output significance of fixed effect.
fit = lmerTest::lmer(data = , formula = DV ~ Fixed_Factor_1 + (Random_intercept + Random slope | Random_Factor_1), ...)
Variable:
DV: dependent variable;Fixed_Factor: Level_1 independent variables as fixed effect. It can also include interaction between independent variables at level 1Random_Factor - Level_2 catergorical variable indicating groups or cluster as random factor.Random_Slope - the Level_1 independent variable that has different effect on Level 2 variableRandom_intercept: the intercpet in Level_2.Symbol:
~, +: a two-sided linear formula object describing both the fixed-effects and randomeffects part of the model, with the response on the left of a ~ operator and the terms, separated by + operators, on the right.|: Random-effects terms are distinguished by vertical bars (|) separating expressions for design matrices from grouping factors.||: Two vertical bars (||) can be used to specify multiple uncorrelated random effects for the same grouping variable. (Because of the way it is implemented, the ||-syntax works only for design matrices containing numeric (continuous) predictors; to fit models with independent categorical effects, see dummy or the lmer_alt function from the afex package.)(source: https://github.com/usplos/Eye-movement-related/blob/master/Linear%20mixed%20model%20(Hierarchical%20linear%20model)%20using%20R%20and%20JAMOVI.md; https://cran.r-project.org/web/packages/lmerTest/lmerTest.pdf)
| Coefficient | R formula | Simplified form |
|---|---|---|
| Fix intercept | lm(Y ~ 1 + X1 + X2, …) | lm(Y ~ X1 + X2, …) |
| Random intercept | lmer(Y ~ 1 + X1 + X2 + (1 + 1| group), …) | lmer(Y ~ X1 + X2 + (1| group), …) |
| Fixed slope | lmer(Y ~ 1 + X1 + X2 + (1 + 1 | group), …) | lmer(Y ~ X1 + X2 + (1 | group), …) |
| Random slope | lmer(Y ~ 1 + X1 + X2 + (1 + X1 | group), …) | lmer(Y ~ X1 + X2 + (X1 | group), …) |
(souce: https://zhuanlan.zhihu.com/p/63092231 )
when talk about LMM, Random intercept is often applied. So only Fixed slope and Random slope are considered.
| Model category | Meaning | Formula | Equivalent form |
|---|---|---|---|
| Fixed slope model (nullmodel, interceptonlymodel, unconditionalmodel) | Random intercept with fixed mean | lmer(Y ~ (1 | g), …) | lmer(Y ~ 1+ (1 | g), …) |
| Fixed slope model | Random intercept with a priori means | lmer(Y ~ 0 + offset(o) + (1 | g), …) | lmer(Y ~ -1 + offset(o) + (1 | g), …) |
| Fixed slope model (crossed random effect) | Intercept varying among g1 and g2 | lmer(Y ~ (1 | g1) + (1 | g2), …) | - |
| Fixed slope model (3-level nested model) | Intercept varying among g1 (lv.3) and g2 (lv.2) within g1 (lv.3) | lmer(Y ~ (1 | g1 / g2), …) | lmer(Y ~ (1 | g1) + (1 | g1: g2), …) |
| Random slope model | Uncorrelated random intercept and slope | lmer(Y ~ X + ( X || g), …) | lmer(Y ~ X + (1 | g) + (0 + X | g), …) |
| Random intercepts and slopes model (recommended if possible) | Correlated random intercept and slope | lmer(Y ~ X + ( X | g), …) | - |
(source: https://cran.r-project.org/web/packages/lme4/vignettes/lmer.pdf )
(crossed random effect, 3-level nested model, source: https://www.youtube.com/watch?v=QCqF-2E86r0 )
Not enought degree of freedom: When number of observation is less than number of random effect, then model cannot converge. Solution 1: simplify the model (i.e. removing the problematic random slopes) Solution 2: increase the number of iteration
fit = lmerTest::lmer(data, formula, control = lmerControl(optCtrl = list(maxfun = 20000)))
Singular fit: the data cannot support the complex random effect structure? Solution: remove the most complex term in the random effect (i.e. random slope)
Work flow of LMM analysis:
sjstats::icc() to calculate the intraclass Correlation Coefficient using the null model, which is used to determine if intraclass variable is large enought to apply LMM# nullmodel = lmerTest::lmer(DV ~ 1+ (1|g), ...)
# sjstats::icc(nullmodel)
summary() to view the the variance of random effect;lmerTest::ranova() to produce a ANOVA-like table for random effects to test significance of Lv.2 variables. It checks whether the model becomes significantly worse if a certain random effect is dropped (formally known as likelihood ratio tests), if this is not the case, the random effect is not significant.# summary(fit1)
# lmerTest::ranova(fit1)
Similar to the GLM/OLS, use anova() to check if the main effects and interaction effect is significant in F-tests.
# anova(fit1)
In analysis with multilevel (i.e. at least two-level), two types of cross-level interactions could occur between the higher and lower level:
Compositional effects: inter-group (or inter-context) differences in an outcome (for example, disease rates) are attributable to differences in group composition (that is, in the characteristics of the individuals of which the groups are comprised)Contextual effects: intra-group differences are attributable to the effects of group level variable or properties.(source: https://jech.bmj.com/content/56/8/588)
Effect size is the ratio of the variance of residual between improved model and the base model. For example, the effect size * 100% refers to the percentage of imporvement by adding variables into the new model.
\[ ES = \frac{variance_{base} - variance_{new}}{variance_{base}} \]
| Effect size | Improvement |
|---|---|
| 0.2 ~ 0.15 | weak |
| 0.15 ~ 0.35 | moderate |
| > 0.35 | strong |
In many research area, LMM is usually applied to the following common randomized statistical experiments The Depanding variable, Random effect, and Fixed effect will be listed out for type of experiments, respectively. During the model selection, it is important to consider whether:
The main advantage of this design is randomization. The post-test comparison with randomized subjects controls for the main effects of history, maturation, and pre-testing; because no pre-test is used there can be no interaction effect of pre-test and X. Another advantage of this design is that it can be extended to include more than Controlled if necessary.
Completely Randomized Design
The RCBD is the standard design for agricultural experiments where similar experimental units are grouped into blocks or replicates.
Randomized Complete Block Design
Like a randomized complete block design (RCBD), a GRBD is randomized. Within each block, treatments are randomly assigned to experimental units: this randomization is also independent between blocks. In a (classic) RCBD, however, there is no replication of treatments within blocks.
library(lmerTest)
library(ggplot2)
library(readr)
library(dplyr)
library(kableExtra)
library(lme4)
The data source is Health Insurance Exchange Public Use Files
# packages
## system
import os, warnings, sys, pathlib, glob
from pathlib import Path
warnings.filterwarnings('ignore')
## data structure
import pandas as pd
## utils
import zipfile
## graphing
import librosa.display
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.collections import PatchCollection
from matplotlib.patches import Rectangle
%matplotlib inline
# create temporary location
directory_to_extract_to = './tmp_PUF'
if not os.path.isdir(directory_to_extract_to):
os.mkdir(directory_to_extract_to)
# unzip function
def unzip_file(path_to_zip_file, directory_to_extract_to):
with zipfile.ZipFile(path_to_zip_file[0], 'r') as zip:
# printing all the contents of the zip file
zip.printdir()
zip.extractall(path = directory_to_extract_to)
# looping over year
rate = pd.DataFrame()
for year in range(2014,2021):
# for year in range(2014,2016):
# locate .zip file
zipfile_name = 'rate_puf'
zipfile_name_y_zip = zipfile_name +'_'+ str(year) + '.zip'
zipfile_name_y_csv = directory_to_extract_to + '/' + zipfile_name + '.csv'
path_to_zip_file = list(Path('.').rglob(zipfile_name_y_zip))
# unzipping
unzip_file(path_to_zip_file, directory_to_extract_to)
tmp_r = pd.read_csv(directory_to_extract_to + '/' + zipfile_name +'.csv').drop(columns = ['RowNumber', 'IssuerId2', 'VersionNum'],
errors='ignore')
tmp_r = tmp_r.melt(tmp_r.columns[:12],
var_name='Plan',
value_name='Rate').dropna()
# store data
rate = pd.concat([rate, tmp_r])
print('update DataFrame shape:',rate.shape)
# clear tmp folder
files = glob.glob(directory_to_extract_to +'/*')
for f in files:
os.remove(f)
print("Removed:" + f)
print('Final dataframe:', rate.shape, rate.columns)
# add 'PrimarySubscriber' variable
rate.loc[(rate['Plan'] =='IndividualRate') ,'PrimarySubscriber'] = 'Individual'
rate.loc[(rate['Plan'] =='IndividualTobaccoRate') ,'PrimarySubscriber'] = 'Individual'
rate.loc[(rate['Plan'] =='PrimarySubscriberAndOneDependent') ,'PrimarySubscriber'] = 'Individual'
rate.loc[(rate['Plan'] =='PrimarySubscriberAndTwoDependents') ,'PrimarySubscriber'] = 'Individual'
rate.loc[(rate['Plan'] =='PrimarySubscriberAndThreeOrMoreDependents') ,'PrimarySubscriber'] = 'Individual'
rate.loc[(rate['Plan'] =='Couple') ,'PrimarySubscriber'] = 'Couple'
rate.loc[(rate['Plan'] =='CoupleAndOneDependent') ,'PrimarySubscriber'] = 'Couple'
rate.loc[(rate['Plan'] =='CoupleAndTwoDependents') ,'PrimarySubscriber'] = 'Couple'
rate.loc[(rate['Plan'] =='CoupleAndThreeOrMoreDependents') ,'PrimarySubscriber'] = 'Couple'
# add 'Dependents' variable
rate.loc[(rate['Plan'] =='IndividualRate') ,'Dependents'] = 'Zero'
rate.loc[(rate['Plan'] =='IndividualTobaccoRate') ,'Dependents'] = 'Zero'
rate.loc[(rate['Plan'] =='PrimarySubscriberAndOneDependent') ,'Dependents'] = 'One'
rate.loc[(rate['Plan'] =='PrimarySubscriberAndTwoDependents') ,'Dependents'] = 'Two'
rate.loc[(rate['Plan'] =='PrimarySubscriberAndThreeOrMoreDependents') ,'Dependents'] = 'ThreeOrMore'
rate.loc[(rate['Plan'] =='Couple') ,'Dependents'] = 'Zero'
rate.loc[(rate['Plan'] =='CoupleAndOneDependent') ,'Dependents'] = 'One'
rate.loc[(rate['Plan'] =='CoupleAndTwoDependents') ,'Dependents'] = 'Two'
rate.loc[(rate['Plan'] =='CoupleAndThreeOrMoreDependents') ,'Dependents'] = 'ThreeOrMore'
# update 'Tobacco' variable
rate.loc[(rate['Plan'] =='IndividualRate') & (rate['Tobacco'] == 'Tobacco User/Non-Tobacco User') ,'Tobacco'] = 'Non-Tobacco User'
rate.loc[(rate['Plan'] =='IndividualTobaccoRate') & (rate['Tobacco'] == 'Tobacco User/Non-Tobacco User') ,'Tobacco'] = 'Tobacco User'
print(rate['PrimarySubscriber'].unique(), rate['Dependents'].unique(), rate['Tobacco'].unique())
# store the cleaned dataset
rate = rate.drop(columns = ['Plan', 'SourceName', 'ImportDate', 'FederalTIN', 'RateEffectiveDate', 'RateExpirationDate', 'PlanId'], errors='ignore')
rate = rate[['IssuerId', 'BusinessYear', 'StateCode', 'RatingAreaId', 'Tobacco', 'Age', 'PrimarySubscriber', 'Dependents', 'Rate']].reset_index(drop=True)
rate.to_csv('rate.csv', index=False)
# read in the data if need
# rate = pd.read_csv('rate.csv')
rate.info()
rate.describe().T
# high pay insurance
rate.loc[(rate['Rate'] > 9000.0), ]
sns.distplot(rate['Rate'][rate['Rate'] > 9000.0])
rate = pd.read_csv('rate.csv')
rate.sample(n=100000, random_state=1).to_csv('rate_100000.csv', index=False)
# input data
rate <- readr::read_csv("rate_100000.csv")
rate$ifFamilyOption <- if_else(rate$Age == 'Family Option', 'Y', 'N')
rate$ifDependents <- if_else(rate$Dependents == 'Zero', 'N', 'Y')
rate$ifTobaccoPreference <- if_else(rate$Tobacco == 'No Preference', 'N', 'Y')
rate$Age <- plyr::revalue(rate$Age, c("Family Option"="0", "65 and over"="64", "64 and over"="64")) %>% as.numeric() # %>% as.factor()
rate$Dependents <- plyr::revalue(rate$Dependents, c("ThreeOrMore"="3", "Two"="2", "Zero"="0", "One"="1")) %>% as.numeric() # %>% as.factor()
rate <- rate %>%
mutate(Rate = Rate +1) %>%
mutate_if(is.character, as.factor) %>%
mutate(IssuerId = IssuerId %>% as.factor())
rate <- rate %>% filter(Rate < 7000)
rate %>% str()
## tibble [99,656 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ IssuerId : Factor w/ 968 levels "10064","10091",..: 610 63 426 426 391 490 191 807 52 63 ...
## $ BusinessYear : num [1:99656] 2015 2014 2016 2015 2020 ...
## $ StateCode : Factor w/ 40 levels "AK","AL","AR",..: 15 6 32 32 2 6 28 38 6 6 ...
## $ RatingAreaId : Factor w/ 67 levels "Rating Area 1",..: 56 4 21 35 45 9 8 4 6 41 ...
## $ Tobacco : Factor w/ 3 levels "No Preference",..: 3 1 2 3 3 1 3 1 1 2 ...
## $ Age : num [1:99656] 61 24 61 31 54 48 53 38 0 36 ...
## $ PrimarySubscriber : Factor w/ 2 levels "Couple","Individual": 2 2 2 2 2 2 2 2 2 2 ...
## $ Dependents : num [1:99656] 0 0 0 0 0 0 0 0 0 0 ...
## $ Rate : num [1:99656] 987 322 801 407 693 ...
## $ ifFamilyOption : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 2 1 ...
## $ ifDependents : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ ifTobaccoPreference: Factor w/ 2 levels "N","Y": 2 1 2 2 2 1 2 1 1 2 ...
## - attr(*, "spec")=
## .. cols(
## .. IssuerId = col_double(),
## .. BusinessYear = col_double(),
## .. StateCode = col_character(),
## .. RatingAreaId = col_character(),
## .. Tobacco = col_character(),
## .. Age = col_character(),
## .. PrimarySubscriber = col_character(),
## .. Dependents = col_character(),
## .. Rate = col_double()
## .. )
par(mfrow=c(1,2))
hist(rate$Rate)
hist(rate$Age)
boxplot(Rate ~ Age, data = rate)
boxplot(Rate ~ ifFamilyOption, data = rate)
boxplot(Rate ~ IssuerId, data = rate)
boxplot(Rate ~ BusinessYear, data = rate)
boxplot(Rate ~ StateCode, data = rate)
boxplot(Rate ~ RatingAreaId, data = rate)
boxplot(Rate ~ Tobacco, data = rate)
boxplot(Rate ~ ifTobaccoPreference, data = rate)
plot(Dependents ~ PrimarySubscriber, data = rate)
boxplot(Rate ~ PrimarySubscriber, data = rate)
boxplot(Rate ~ Dependents, data = rate)
boxplot(Rate ~ ifDependents, data = rate)
I construct the null model by testing any combination of the following three variables. Notice that IssuerId and StateCode might hive nested relation.
null_model_lv2_v1 <- lmer(Rate ~ (1|RatingAreaId),
data = rate) ; summary(null_model_lv2_v1)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: Rate ~ (1 | RatingAreaId)
## Data: rate
##
## REML criterion at convergence: 1448766
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.3920 -0.9682 -0.1129 0.5240 14.2853
##
## Random effects:
## Groups Name Variance Std.Dev.
## RatingAreaId (Intercept) 1625 40.31
## Residual 120365 346.94
## Number of obs: 99656, groups: RatingAreaId, 67
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 377.537 5.335 54.365 70.77 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lmerTest::ranova(null_model_lv2_v1)
## ANOVA-like table for random-effects: Single term deletions
##
## Model:
## Rate ~ (1 | RatingAreaId)
## npar logLik AIC LRT Df Pr(>Chisq)
## <none> 3 -724383 1448772
## (1 | RatingAreaId) 2 -724514 1449032 261.98 1 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
null_model_lv2_v2 <- lmer(Rate ~ (1|StateCode),
data = rate) ; summary(null_model_lv2_v2)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: Rate ~ (1 | StateCode)
## Data: rate
##
## REML criterion at convergence: 1443648
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.0982 -0.7837 -0.1526 0.5140 14.6220
##
## Random effects:
## Groups Name Variance Std.Dev.
## StateCode (Intercept) 6859 82.82
## Residual 114325 338.12
## Number of obs: 99656, groups: StateCode, 40
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 384.26 13.28 39.43 28.94 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lmerTest::ranova(null_model_lv2_v2)
## ANOVA-like table for random-effects: Single term deletions
##
## Model:
## Rate ~ (1 | StateCode)
## npar logLik AIC LRT Df Pr(>Chisq)
## <none> 3 -721824 1443655
## (1 | StateCode) 2 -724514 1449032 5379.6 1 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# fail to converge
null_model_lv3_v1 <- lmer(Rate ~ (1|StateCode/RatingAreaId),
data = rate) ; summary(null_model_lv3_v1)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: Rate ~ (1 | StateCode/RatingAreaId)
## Data: rate
##
## REML criterion at convergence: 1442638
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.2684 -0.7486 -0.1674 0.5097 14.6686
##
## Random effects:
## Groups Name Variance Std.Dev.
## RatingAreaId:StateCode (Intercept) 1806 42.49
## StateCode (Intercept) 6602 81.25
## Residual 112520 335.44
## Number of obs: 99656, groups: RatingAreaId:StateCode, 438; StateCode, 40
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 380.30 13.34 38.69 28.5 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## convergence code: 0
## Model failed to converge with max|grad| = 0.00227963 (tol = 0.002, component 1)
lmerTest::ranova(null_model_lv3_v1)
## ANOVA-like table for random-effects: Single term deletions
##
## Model:
## Rate ~ (1 | RatingAreaId:StateCode) + (1 | StateCode)
## npar logLik AIC LRT Df Pr(>Chisq)
## <none> 4 -721319 1442646
## (1 | RatingAreaId:StateCode) 3 -721824 1443655 1010.6 1 < 2.2e-16 ***
## (1 | StateCode) 3 -721518 1443041 397.4 1 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# the best null model
null_model_lv2_v3<- lmer(Rate ~ (1|IssuerId),
data = rate) ; summary(null_model_lv2_v3)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: Rate ~ (1 | IssuerId)
## Data: rate
##
## REML criterion at convergence: 1386404
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.8126 -0.5543 -0.0405 0.1829 18.3573
##
## Random effects:
## Groups Name Variance Std.Dev.
## IssuerId (Intercept) 61589 248.2
## Residual 62420 249.8
## Number of obs: 99656, groups: IssuerId, 968
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 241.758 8.363 984.955 28.91 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lmerTest::ranova(null_model_lv2_v3)
## ANOVA-like table for random-effects: Single term deletions
##
## Model:
## Rate ~ (1 | IssuerId)
## npar logLik AIC LRT Df Pr(>Chisq)
## <none> 3 -693202 1386410
## (1 | IssuerId) 2 -724514 1449032 62624 1 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# singular
null_model_lv3_v2<- lmer(Rate ~ (1|IssuerId) + (1|StateCode/RatingAreaId),
data = rate) ; summary(null_model_lv3_v2)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: Rate ~ (1 | IssuerId) + (1 | StateCode/RatingAreaId)
## Data: rate
##
## REML criterion at convergence: 1386094
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.9067 -0.5533 -0.0623 0.1843 18.4046
##
## Random effects:
## Groups Name Variance Std.Dev.
## IssuerId (Intercept) 61703.1 248.40
## RatingAreaId:StateCode (Intercept) 500.4 22.37
## StateCode (Intercept) 0.0 0.00
## Residual 61981.8 248.96
## Number of obs: 99656, groups:
## IssuerId, 968; RatingAreaId:StateCode, 438; StateCode, 40
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 241.798 8.467 1024.318 28.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## convergence code: 0
## boundary (singular) fit: see ?isSingular
lmerTest::ranova(null_model_lv3_v2)
## ANOVA-like table for random-effects: Single term deletions
##
## Model:
## Rate ~ (1 | IssuerId) + (1 | RatingAreaId:StateCode) + (1 | StateCode)
## npar logLik AIC LRT Df Pr(>Chisq)
## <none> 5 -693047 1386104
## (1 | IssuerId) 4 -721319 1442646 56544 1 <2e-16 ***
## (1 | RatingAreaId:StateCode) 4 -693202 1386412 310 1 <2e-16 ***
## (1 | StateCode) 4 -693047 1386102 0 1 0.9998
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
list(null_model_lv2_v1, null_model_lv2_v2, null_model_lv3_v1, null_model_lv2_v3, null_model_lv3_v2) %>%
tibble() %>%
setNames('nullmodel') %>%
mutate(formula = nullmodel %>%
purrr::map_chr(. %>% .@call %>% .[2] %>% as.character() )) %>%
mutate(lv2_variance = nullmodel %>%
purrr::map_dbl(. %>% insight::get_variance_residual() %>% .[[1]]%>% as.numeric() )) %>%
mutate(residual_variance = nullmodel %>%
purrr::map_dbl(. %>% insight::get_variance_intercept() %>% as.numeric() %>% sum() )) %>%
mutate(total_variance = lv2_variance + residual_variance) %>%
mutate(ICC_adjusted = nullmodel %>%
purrr::map_dbl(. %>% performance::icc() %>% .[[1]]%>% as.numeric() )) %>%
select(-nullmodel) %>%
kable() %>% kable_styling()
| formula | lv2_variance | residual_variance | total_variance | ICC_adjusted |
|---|---|---|---|---|
| Rate ~ (1 | RatingAreaId) | 120364.73 | 1624.755 | 121989.5 | 0.0133188 |
| Rate ~ (1 | StateCode) | 114324.65 | 6859.487 | 121184.1 | 0.0566038 |
| Rate ~ (1 | StateCode/RatingAreaId) | 112519.93 | 8408.009 | 120927.9 | 0.0695291 |
| Rate ~ (1 | IssuerId) | 62419.51 | 61588.629 | 124008.1 | 0.4966499 |
| Rate ~ (1 | IssuerId) + (1 | StateCode/RatingAreaId) | 61981.82 | 62203.556 | 124185.4 | NA |
The ICC show that IssuerId has strong effect on the parameter, and StateCode and StateCode/RatingAreaId has moderate effect. so I decide to use the three corresponding null model as the base model.
I am going to add fixed slopes to the models by including the following variables:
# adding fixed effect
# not much impovement
fixslope_LMM_lv3_v1 <- lmer(Rate ~ Age + ifTobaccoPreference + PrimarySubscriber + ifDependents + BusinessYear + (1|IssuerId) + (1|StateCode/RatingAreaId), data = rate)
summary(fixslope_LMM_lv3_v1)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: Rate ~ Age + ifTobaccoPreference + PrimarySubscriber + ifDependents +
## BusinessYear + (1 | IssuerId) + (1 | StateCode/RatingAreaId)
## Data: rate
##
## REML criterion at convergence: 1316459
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -4.8065 -0.5815 -0.0928 0.4768 21.8251
##
## Random effects:
## Groups Name Variance Std.Dev.
## IssuerId (Intercept) 4.773e+04 2.185e+02
## RatingAreaId:StateCode (Intercept) 5.333e+02 2.309e+01
## StateCode (Intercept) 8.313e-05 9.118e-03
## Residual 4.121e+04 2.030e+02
## Number of obs: 97489, groups:
## IssuerId, 966; RatingAreaId:StateCode, 438; StateCode, 40
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) -5.434e+04 8.448e+02 9.724e+04 -64.32 <2e-16
## Age 1.008e+01 4.861e-02 9.656e+04 207.31 <2e-16
## ifTobaccoPreferenceY 1.120e+02 2.337e+00 9.653e+04 47.91 <2e-16
## PrimarySubscriberIndividual -1.459e+02 1.104e+01 9.699e+04 -13.21 <2e-16
## ifDependentsY 3.099e+02 9.725e+00 8.957e+04 31.86 <2e-16
## BusinessYear 2.692e+01 4.190e-01 9.724e+04 64.26 <2e-16
##
## (Intercept) ***
## Age ***
## ifTobaccoPreferenceY ***
## PrimarySubscriberIndividual ***
## ifDependentsY ***
## BusinessYear ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) Age ifTbPY PrmrSI ifDpnY
## Age -0.069
## ifTbccPrfrY -0.023 0.007
## PrmrySbscrI 0.009 -0.065 -0.003
## ifDepndntsY -0.057 0.153 0.009 0.487
## BusinessYer -1.000 0.068 0.022 -0.022 0.049
## convergence code: 0
## boundary (singular) fit: see ?isSingular
lmerTest::ranova(fixslope_LMM_lv3_v1)
## ANOVA-like table for random-effects: Single term deletions
##
## Model:
## Rate ~ Age + ifTobaccoPreference + PrimarySubscriber + ifDependents +
## BusinessYear + (1 | IssuerId) + (1 | RatingAreaId:StateCode) +
## (1 | StateCode)
## npar logLik AIC LRT Df Pr(>Chisq)
## <none> 10 -658230 1316479
## (1 | IssuerId) 9 -681632 1363282 46805 1 <2e-16 ***
## (1 | RatingAreaId:StateCode) 9 -658508 1317033 556 1 <2e-16 ***
## (1 | StateCode) 9 -658230 1316477 0 1 1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(fixslope_LMM_lv3_v1)
# not much impovement
fixslope_LMM_lv3_v2 <- lmer(Rate ~ Age + ifTobaccoPreference + PrimarySubscriber + ifDependents + BusinessYear + (1|IssuerId) + (1|RatingAreaId), data = rate)
summary(fixslope_LMM_lv3_v2)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: Rate ~ Age + ifTobaccoPreference + PrimarySubscriber + ifDependents +
## BusinessYear + (1 | IssuerId) + (1 | RatingAreaId)
## Data: rate
##
## REML criterion at convergence: 1316948
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -4.7243 -0.5871 -0.0930 0.4793 21.7949
##
## Random effects:
## Groups Name Variance Std.Dev.
## IssuerId (Intercept) 47731.2 218.5
## RatingAreaId (Intercept) 174.1 13.2
## Residual 41596.6 204.0
## Number of obs: 97489, groups: IssuerId, 966; RatingAreaId, 67
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) -5.472e+04 8.467e+02 9.732e+04 -64.62 <2e-16
## Age 1.007e+01 4.878e-02 9.672e+04 206.50 <2e-16
## ifTobaccoPreferenceY 1.112e+02 2.342e+00 9.654e+04 47.48 <2e-16
## PrimarySubscriberIndividual -1.469e+02 1.108e+01 9.711e+04 -13.27 <2e-16
## ifDependentsY 3.098e+02 9.755e+00 8.959e+04 31.76 <2e-16
## BusinessYear 2.711e+01 4.199e-01 9.731e+04 64.56 <2e-16
##
## (Intercept) ***
## Age ***
## ifTobaccoPreferenceY ***
## PrimarySubscriberIndividual ***
## ifDependentsY ***
## BusinessYear ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) Age ifTbPY PrmrSI ifDpnY
## Age -0.069
## ifTbccPrfrY -0.022 0.007
## PrmrySbscrI 0.009 -0.065 -0.003
## ifDepndntsY -0.057 0.153 0.009 0.487
## BusinessYer -1.000 0.068 0.022 -0.021 0.049
lmerTest::ranova(fixslope_LMM_lv3_v2)
## ANOVA-like table for random-effects: Single term deletions
##
## Model:
## Rate ~ Age + ifTobaccoPreference + PrimarySubscriber + ifDependents +
## BusinessYear + (1 | IssuerId) + (1 | RatingAreaId)
## npar logLik AIC LRT Df Pr(>Chisq)
## <none> 9 -658474 1316966
## (1 | IssuerId) 8 -685651 1371317 54354 1 < 2.2e-16 ***
## (1 | RatingAreaId) 8 -658508 1317031 68 1 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(fixslope_LMM_lv3_v2)
# not much impovement
fixslope_LMM_lv3_v3 <- lmer(Rate ~ Age + ifTobaccoPreference + PrimarySubscriber + ifDependents + BusinessYear + (1|IssuerId) + (1|StateCode), data = rate)
summary(fixslope_LMM_lv3_v3)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: Rate ~ Age + ifTobaccoPreference + PrimarySubscriber + ifDependents +
## BusinessYear + (1 | IssuerId) + (1 | StateCode)
## Data: rate
##
## REML criterion at convergence: 1317015
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -4.6799 -0.5873 -0.0932 0.4808 21.7522
##
## Random effects:
## Groups Name Variance Std.Dev.
## IssuerId (Intercept) 47695 218.4
## StateCode (Intercept) 0 0.0
## Residual 41665 204.1
## Number of obs: 97489, groups: IssuerId, 966; StateCode, 40
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) -5.475e+04 8.472e+02 9.735e+04 -64.62 <2e-16
## Age 1.007e+01 4.880e-02 9.677e+04 206.42 <2e-16
## ifTobaccoPreferenceY 1.110e+02 2.343e+00 9.658e+04 47.38 <2e-16
## PrimarySubscriberIndividual -1.471e+02 1.108e+01 9.713e+04 -13.27 <2e-16
## ifDependentsY 3.102e+02 9.761e+00 8.961e+04 31.78 <2e-16
## BusinessYear 2.712e+01 4.201e-01 9.734e+04 64.56 <2e-16
##
## (Intercept) ***
## Age ***
## ifTobaccoPreferenceY ***
## PrimarySubscriberIndividual ***
## ifDependentsY ***
## BusinessYear ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) Age ifTbPY PrmrSI ifDpnY
## Age -0.069
## ifTbccPrfrY -0.023 0.007
## PrmrySbscrI 0.009 -0.065 -0.003
## ifDepndntsY -0.057 0.153 0.008 0.487
## BusinessYer -1.000 0.068 0.022 -0.021 0.049
## convergence code: 0
## boundary (singular) fit: see ?isSingular
lmerTest::ranova(fixslope_LMM_lv3_v3)
## ANOVA-like table for random-effects: Single term deletions
##
## Model:
## Rate ~ Age + ifTobaccoPreference + PrimarySubscriber + ifDependents +
## BusinessYear + (1 | IssuerId) + (1 | StateCode)
## npar logLik AIC LRT Df Pr(>Chisq)
## <none> 9 -658508 1317033
## (1 | IssuerId) 8 -682073 1364161 47130 1 <2e-16 ***
## (1 | StateCode) 8 -658508 1317031 0 1 1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(fixslope_LMM_lv3_v3)
# best model so far
fixslope_LMM_lv3_v4 <- lmer(Rate ~ Age + ifTobaccoPreference + PrimarySubscriber + ifDependents + BusinessYear + (1|IssuerId), data = rate)
summary(fixslope_LMM_lv3_v4)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: Rate ~ Age + ifTobaccoPreference + PrimarySubscriber + ifDependents +
## BusinessYear + (1 | IssuerId)
## Data: rate
##
## REML criterion at convergence: 1317015
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -4.6799 -0.5873 -0.0932 0.4808 21.7522
##
## Random effects:
## Groups Name Variance Std.Dev.
## IssuerId (Intercept) 47695 218.4
## Residual 41665 204.1
## Number of obs: 97489, groups: IssuerId, 966
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) -5.475e+04 8.472e+02 9.734e+04 -64.62 <2e-16
## Age 1.007e+01 4.880e-02 9.677e+04 206.42 <2e-16
## ifTobaccoPreferenceY 1.110e+02 2.343e+00 9.658e+04 47.38 <2e-16
## PrimarySubscriberIndividual -1.471e+02 1.108e+01 9.713e+04 -13.27 <2e-16
## ifDependentsY 3.102e+02 9.761e+00 8.961e+04 31.78 <2e-16
## BusinessYear 2.712e+01 4.201e-01 9.734e+04 64.56 <2e-16
##
## (Intercept) ***
## Age ***
## ifTobaccoPreferenceY ***
## PrimarySubscriberIndividual ***
## ifDependentsY ***
## BusinessYear ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) Age ifTbPY PrmrSI ifDpnY
## Age -0.069
## ifTbccPrfrY -0.023 0.007
## PrmrySbscrI 0.009 -0.065 -0.003
## ifDepndntsY -0.057 0.153 0.008 0.487
## BusinessYer -1.000 0.068 0.022 -0.021 0.049
lmerTest::ranova(fixslope_LMM_lv3_v4)
## ANOVA-like table for random-effects: Single term deletions
##
## Model:
## Rate ~ Age + ifTobaccoPreference + PrimarySubscriber + ifDependents +
## BusinessYear + (1 | IssuerId)
## npar logLik AIC LRT Df Pr(>Chisq)
## <none> 8 -658508 1317031
## (1 | IssuerId) 7 -685770 1371555 54526 1 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(fixslope_LMM_lv3_v4)
I did not find any random slope-intercept model, so the full model remain as fixed slop model. The summary is shown at the following:
list(fixslope_LMM_lv3_v1, fixslope_LMM_lv3_v2, fixslope_LMM_lv3_v3, fixslope_LMM_lv3_v4) %>%
tibble() %>%
setNames('fixedslopemodel') %>%
mutate(formula = fixedslopemodel %>%
purrr::map_chr(. %>% .@call %>% .[2] %>% as.character() )) %>%
mutate(lv2_variance = fixedslopemodel %>%
purrr::map_dbl(. %>% insight::get_variance_residual() %>% .[[1]]%>% as.numeric() )) %>%
mutate(residual_variance = fixedslopemodel %>%
purrr::map_dbl(. %>% insight::get_variance_intercept() %>% as.numeric() %>% sum() )) %>%
mutate(total_variance = lv2_variance + residual_variance) %>%
mutate(ICC_adjusted = fixedslopemodel %>%
purrr::map_dbl(. %>% performance::icc() %>% .[[1]]%>% as.numeric() )) %>%
select(-fixedslopemodel) %>%
kable() %>% kable_styling()
| formula | lv2_variance | residual_variance | total_variance | ICC_adjusted |
|---|---|---|---|---|
| Rate ~ Age + ifTobaccoPreference + PrimarySubscriber + ifDependents + BusinessYear + (1 | IssuerId) + (1 | StateCode/RatingAreaId) | 41212.47 | 48261.93 | 89474.40 | 0.5393937 |
| Rate ~ Age + ifTobaccoPreference + PrimarySubscriber + ifDependents + BusinessYear + (1 | IssuerId) + (1 | RatingAreaId) | 41596.61 | 47905.29 | 89501.89 | 0.5352433 |
| Rate ~ Age + ifTobaccoPreference + PrimarySubscriber + ifDependents + BusinessYear + (1 | IssuerId) + (1 | StateCode) | 41665.14 | 47694.91 | 89360.05 | NA |
| Rate ~ Age + ifTobaccoPreference + PrimarySubscriber + ifDependents + BusinessYear + (1 | IssuerId) | 41665.14 | 47694.91 | 89360.05 | 0.5337386 |
fixslope_LM <- lm(Rate ~ Age + ifTobaccoPreference + PrimarySubscriber + ifDependents + BusinessYear,
data = rate) ; summary(fixslope_LM)
##
## Call:
## lm(formula = Rate ~ Age + ifTobaccoPreference + PrimarySubscriber +
## ifDependents + BusinessYear, data = rate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -530.6 -176.5 -52.6 139.6 4651.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.466e+04 9.575e+02 -36.200 < 2e-16 ***
## Age 1.015e+01 6.522e-02 155.574 < 2e-16 ***
## ifTobaccoPreferenceY 3.331e+02 1.781e+00 187.046 < 2e-16 ***
## PrimarySubscriberIndividual -1.153e+02 1.428e+01 -8.075 6.84e-16 ***
## ifDependentsY 2.483e+02 1.164e+01 21.334 < 2e-16 ***
## BusinessYear 1.714e+01 4.748e-01 36.108 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 274.7 on 97483 degrees of freedom
## (2167 observations deleted due to missingness)
## Multiple R-squared: 0.3825, Adjusted R-squared: 0.3824
## F-statistic: 1.207e+04 on 5 and 97483 DF, p-value: < 2.2e-16
anova(fixslope_LM)
## Analysis of Variance Table
##
## Response: Rate
## Df Sum Sq Mean Sq F value Pr(>F)
## Age 1 1887697470 1887697470 25024.64 < 2.2e-16 ***
## ifTobaccoPreference 1 2502024051 2502024051 33168.59 < 2.2e-16 ***
## PrimarySubscriber 1 37987482 37987482 503.59 < 2.2e-16 ***
## ifDependents 1 28040030 28040030 371.72 < 2.2e-16 ***
## BusinessYear 1 98349237 98349237 1303.79 < 2.2e-16 ***
## Residuals 97483 7353488144 75434
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
par(mfrow=c(2,2)); plot(fixslope_LM)
# compare LM VS LMM using RMSE
anova(fixslope_LMM_lv3_v4, fixslope_LM)
## Data: rate
## Models:
## fixslope_LM: Rate ~ Age + ifTobaccoPreference + PrimarySubscriber + ifDependents +
## fixslope_LM: BusinessYear
## fixslope_LMM_lv3_v4: Rate ~ Age + ifTobaccoPreference + PrimarySubscriber + ifDependents +
## fixslope_LMM_lv3_v4: BusinessYear + (1 | IssuerId)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## fixslope_LM 7 1371569 1371636 -685778 1371555
## fixslope_LMM_lv3_v4 8 1317049 1317125 -658517 1317033 54522 1 < 2.2e-16
##
## fixslope_LM
## fixslope_LMM_lv3_v4 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
performance::performance_rmse(fixslope_LMM_lv3_v4)
## [1] 203.1839
performance::performance_rmse(fixslope_LM)
## [1] 274.6432
The linear mixed model has significant lower RMSE than its ordinary linear model.
The residual plot shown the model has issue of heteroscedasticity. One possible solution is to use the generalized version of LMM, for a quick example :
(Note: the GLLM is not successfully converge, need to fix later)
fullmodel_GLLM <- lme4::glmer(Rate ~ Age + ifTobaccoPreference + PrimarySubscriber + ifDependents + BusinessYear + (1|IssuerId), data = rate %>% as.data.frame(), family = Gamma(link="log") )
summary(fullmodel_GLLM)
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: Gamma ( log )
## Formula: Rate ~ Age + ifTobaccoPreference + PrimarySubscriber + ifDependents +
## BusinessYear + (1 | IssuerId)
## Data: rate %>% as.data.frame()
##
## AIC BIC logLik deviance df.resid
## 1182895.0 1182970.9 -591439.5 1182879.0 97481
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.784 -0.485 -0.086 0.379 43.002
##
## Random effects:
## Groups Name Variance Std.Dev.
## IssuerId (Intercept) 1.2112 1.101
## Residual 0.3136 0.560
## Number of obs: 97489, groups: IssuerId, 966
##
## Fixed effects:
## Estimate Std. Error t value Pr(>|z|)
## (Intercept) -6.587e+01 1.080e+00 -60.97 <2e-16 ***
## Age 1.908e-02 1.374e-04 138.86 <2e-16 ***
## ifTobaccoPreferenceY 3.835e-01 6.739e-03 56.91 <2e-16 ***
## PrimarySubscriberIndividual -4.919e-01 3.085e-02 -15.95 <2e-16 ***
## ifDependentsY 1.138e+00 2.703e-02 42.09 <2e-16 ***
## BusinessYear 3.454e-02 5.366e-04 64.36 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) Age ifTbPY PrmrSI ifDpnY
## Age -0.034
## ifTbccPrfrY 0.012 0.004
## PrmrySbscrI 0.011 -0.072 0.000
## ifDepndntsY -0.043 0.135 0.001 0.363
## BusinessYer -0.998 0.031 -0.014 -0.039 0.030
## convergence code: 0
## Model failed to converge with max|grad| = 0.00363449 (tol = 0.001, component 1)
## Model is nearly unidentifiable: very large eigenvalue
## - Rescale variables?
## Model is nearly unidentifiable: large eigenvalue ratio
## - Rescale variables?
plot(fullmodel_GLLM)
fullmodel_GLM <- glm(Rate ~ Age + ifTobaccoPreference + PrimarySubscriber + ifDependents + BusinessYear,
data = rate, family = Gamma())
summary(fullmodel_GLM)
##
## Call:
## glm(formula = Rate ~ Age + ifTobaccoPreference + PrimarySubscriber +
## ifDependents + BusinessYear, family = Gamma(), data = rate)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0964 -1.4289 -0.1370 0.2854 2.9436
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.582e-01 6.901e-03 22.924 < 2e-16 ***
## Age -6.154e-05 5.864e-07 -104.936 < 2e-16 ***
## ifTobaccoPreferenceY -2.505e-03 2.099e-05 -119.386 < 2e-16 ***
## PrimarySubscriberIndividual -2.401e-03 5.664e-04 -4.239 2.24e-05 ***
## ifDependentsY 4.228e-03 4.558e-04 9.277 < 2e-16 ***
## BusinessYear -7.366e-05 3.412e-06 -21.591 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 0.9251532)
##
## Null deviance: 165050 on 97488 degrees of freedom
## Residual deviance: 133122 on 97483 degrees of freedom
## (2167 observations deleted due to missingness)
## AIC: 1321925
##
## Number of Fisher Scoring iterations: 7
anova(fullmodel_GLM)
## Analysis of Deviance Table
##
## Model: Gamma, link: inverse
##
## Response: Rate
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev
## NULL 97488 165050
## Age 1 13759.5 97487 151290
## ifTobaccoPreference 1 17505.8 97486 133784
## PrimarySubscriber 1 151.3 97485 133633
## ifDependents 1 104.4 97484 133529
## BusinessYear 1 407.1 97483 133122
car::crPlots(fullmodel_GLM)